Initial overview of data


This dataset documents information on cancer death rates for every county in the United States. The dataset was found on Data World, uploaded by Noah Rippner as a challenge to predict the outcome variable, TARGET_deathRate (mean per capita (100,000) cancer mortalities). The link to his project page is here: https://data.world/nrippner/ols-regression-challenge. In his description he cites contributions of his aggregated data set from the American Community Survey (census.gov), clinicaltrials.gov, and cancer.gov. The dataset was downloaded from Data World as a csv file and uploaded to this Rmd file as seen above.



Data summary
Name cancer
Number of rows 3047
Number of columns 34
_______________________
Column type frequency:
character 2
numeric 32
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
binnedInc 0 1 16 18 0 10 0
Geography 0 1 16 42 0 3047 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
avgAnnCount 0 1.00 606.34 1416.36 6.00 76.00 171.00 518.00 38150.00
avgDeathsPerYear 0 1.00 185.97 504.13 3.00 28.00 61.00 149.00 14010.00
TARGET_deathRate 0 1.00 178.66 27.75 59.70 161.20 178.10 195.20 362.80
incidenceRate 0 1.00 448.27 54.56 201.30 420.30 453.55 480.85 1206.90
medIncome 0 1.00 47063.28 12040.09 22640.00 38882.50 45207.00 52492.00 125635.00
popEst2015 0 1.00 102637.37 329059.22 827.00 11684.00 26643.00 68671.00 10170292.00
povertyPercent 0 1.00 16.88 6.41 3.20 12.15 15.90 20.40 47.40
studyPerCap 0 1.00 155.40 529.63 0.00 0.00 0.00 83.65 9762.31
MedianAge 0 1.00 45.27 45.30 22.30 37.70 41.00 44.00 624.00
MedianAgeMale 0 1.00 39.57 5.23 22.40 36.35 39.60 42.50 64.70
MedianAgeFemale 0 1.00 42.15 5.29 22.30 39.10 42.40 45.30 65.70
AvgHouseholdSize 0 1.00 2.48 0.43 0.02 2.37 2.50 2.63 3.97
PercentMarried 0 1.00 51.77 6.90 23.10 47.75 52.40 56.40 72.50
PctNoHS18_24 0 1.00 18.22 8.09 0.00 12.80 17.10 22.70 64.10
PctHS18_24 0 1.00 35.00 9.07 0.00 29.20 34.70 40.70 72.50
PctSomeCol18_24 2285 0.25 40.98 11.12 7.10 34.00 40.40 46.40 79.00
PctBachDeg18_24 0 1.00 6.16 4.53 0.00 3.10 5.40 8.20 51.80
PctHS25_Over 0 1.00 34.80 7.03 7.50 30.40 35.30 39.65 54.80
PctBachDeg25_Over 0 1.00 13.28 5.39 2.50 9.40 12.30 16.10 42.20
PctEmployed16_Over 152 0.95 54.15 8.32 17.60 48.60 54.50 60.30 80.10
PctUnemployed16_Over 0 1.00 7.85 3.45 0.40 5.50 7.60 9.70 29.40
PctPrivateCoverage 0 1.00 64.35 10.65 22.30 57.20 65.10 72.10 92.30
PctPrivateCoverageAlone 609 0.80 48.45 10.08 15.70 41.00 48.70 55.60 78.90
PctEmpPrivCoverage 0 1.00 41.20 9.45 13.50 34.50 41.10 47.70 70.70
PctPublicCoverage 0 1.00 36.25 7.84 11.20 30.90 36.30 41.55 65.10
PctPublicCoverageAlone 0 1.00 19.24 6.11 2.60 14.85 18.80 23.10 46.60
PctWhite 0 1.00 83.65 16.38 10.20 77.30 90.06 95.45 100.00
PctBlack 0 1.00 9.11 14.53 0.00 0.62 2.25 10.51 85.95
PctAsian 0 1.00 1.25 2.61 0.00 0.25 0.55 1.22 42.62
PctOtherRace 0 1.00 1.98 3.52 0.00 0.30 0.83 2.18 41.93
PctMarriedHouseholds 0 1.00 51.24 6.57 22.99 47.76 51.67 55.40 78.08
BirthRate 0 1.00 5.64 1.99 0.00 4.52 5.38 6.49 21.33
## # A tibble: 34 × 3
##    variable                n_miss pct_miss
##    <chr>                    <int>    <dbl>
##  1 PctSomeCol18_24           2285    75.0 
##  2 PctPrivateCoverageAlone    609    20.0 
##  3 PctEmployed16_Over         152     4.99
##  4 avgAnnCount                  0     0   
##  5 avgDeathsPerYear             0     0   
##  6 TARGET_deathRate             0     0   
##  7 incidenceRate                0     0   
##  8 medIncome                    0     0   
##  9 popEst2015                   0     0   
## 10 povertyPercent               0     0   
## # … with 24 more rows



Within the data set there are 3047 observations for 33 feature variables and 1 target variable, TARGET_deathRate (Mean per capita (100,000) cancer mortalities). Only 3 variables are missing any observations: PctSomeCol18_24 (2285 missing), PctEmployed16_Over (152 missing), PctPrivateCoverageAlone (609 missing). PctSomeCol18_24 is missing about 75% of its observations and should be removed as a variable used in prediction. PctEmployed16_Over is missing about 5% of its observations and PctPrivateCoverageAlone about 20%, so we can use imputation methods to fix the missingness problem.


Most of the missing data for PctEmployed16_Over occurs when there is a large value of PctWhite. There is a noticeable amount of missing values for middle values of PctBlack and a very small amount of missing values for small values of PctAsian.



There is a similar trend for missingness in PctPrivateCoverageAlone. Most of the missing data occurs when there is a large value of PctWhite. There is a noticeable amount of missing values for middle values of PctBlack and a very small amount of missing values for small values of PctAsian.



Additionally, most of the missing values for these same two variables occur with lower values of medIncome.


The missingness is concentrated between values of 30 and 50 for MedianAge.

Essential Findings

Univariate: Response Variable

Important Predictor Variables

It is essential to identify which variables may be heavily involved in the process of developing an accurate and precise model to predict TARGET_deathRate. One method to approach this includes simply analyzing the data and predicting which variables have the greatest association with the response variable. While we hypothesized that factors like income, insurance, and race may be heavily involved in predicting cancer mortalities, there are methods to more accurately find patterns and associations between variables. One of which includes the correlation plot. It is important to note that variable with missing data (PctPrivateCoverageAlone and PctEmployed16_Over) are not represented in the correlation plot and are independently explored. PctSomeCol18_24 is not explored in this EDA due to its extreme extent of missingness. It will not be considered in our future recipe(s).

From the correlation plot above, we could identify variables that may have patterns aligning with that of the target variable TARGET_deathRate; the first column serves to identify this association. Here, the blue squares refer to predictor variables with perfectly positive linear correlation with the response variable while the red squares correspond to predictor variables with perfectly negative linear correlation with the response variable. From this, we can see that povertyPercent, PctHS25_Over, PctUnemployed16_Over, PctPublicCoverage, and PctPublicCoverageAlone are most notably positively correlated with TARGET_deathRate. By contrast, the predictor variables negatively correlated with the response variable are medIncome, PctBachDeg25_Over, and PctPrivateCoverage.

These data make sense when you extend off the numerical data and understand what the variables truly mean. The positive predictor variables generally correspond to characteristics tying back to or a result of low income: poverty, low level of education, unemployment, and government-provided insurance. Another notable but less positive predictor variable is PctBlack in which factors like the systematic racism faced may place them in conditions that are not ideal or makes it hard for them to get tested and receive treatment. This is also a consistent trend in the negatively correlated predictor variables: a higher income, better education, and private more extensive insurance plan helps prevent as well as treat cancer, leading to better outcomes of less cancer mortalities.

It is also interesting to utilize the correlation plot to identify relationships that exist between predictor variables. While certain relationships are more obvious, such as a higher income leading to a higher degree of education and a lower degree of government-provided healthcare, there are patterns between variables that may not be well known.

From the data above, it is clear to see the contrasting association between marriage rates and race. While it is negative correlated among black people, it is positively correlated in white people. While this simply may be an observed trend due to the small sample size we are working with in which observations were taken from single black individuals and married white individuals, it may be important to keep in mind when developing models to predict the response variable.

Secondary Findings

Standard variable explorations for the domain area that are unsurprising and mainly conducted out of convention. Findings that don’t seem interesting or important, but show some potential.

Univariate: Positive Predictors



For povertyPercent, PctHS25_Over, PctUnemployed16_Over, PctPublicCoverage, and PctPublicCoverageAlone, both the boxplot and density plots above show the distribution of the strong positive predictor variables that do not need to be transformed in any way such as, for example, with log transformations. They can be used as is for predicting.

Bivariate: Positive Predictors


The graphs above support the strong positive relationship between povertyPercent, PctHS25_Over, PctUnemployed16_Over , PctPublicCoverage, PctPublicCoverageAlone, and TARGET_deathRate that was seen in the correlation plot.



These two plots represent the two most positive relationships from the correlation plot. They both show naturally occurring relationships, as a larger population results in more cancer-related deaths, even if it is at the same rate as a smaller population. Similarly, a higher number of reported cancer cases per year will naturally correlate with a higher cancer-related mortality rate.

Univariate: Negative Predictors



For medIncome, PctBachDeg25_Over, and PctPrivateCoverage, both the boxplot and density plots above show the distribution of the strong negative predictor variables that do not need to be transformed in any way such as, for example, with log transformations. They can be used as is for predicting.

Bivariate: Negative Predictors



The graphs above support the strong negative relationship between medIncome, PctBachDeg25_Over, PctPrivateCoverage and TARGET_deathRate that was seen in the correlation plot.



These two plots represent the two most negative relationships from the correlation plot. They both show naturally occurring relationships, as a high percent of private coverage means a low percent of reliance on government assistance. Similarly, a high percent of county population identifying as white means a low percent of county population identifying as another race, such as black.



Since PctPrivateCoverageAlone is missing 20% of its data, we independently explored variables that we though would have a strong naturally occurring relationship with PctPrivateCoverageAlone, such as PctPublicCoverage. The plot above shows a strong negative relationship that could be utilized for an imputation step in a future recipe.



Since PctEmployed16_Over is missing 5% of its data, we independently explored variables that we though would have a strong naturally occurring relationship with PctEmployed16_Over, such as PctUnemployed16_Over. The plot above shows a strong negative relationship that could be utilized for an imputation step in a future recipe.

Conclusion

From this analysis, we got a good understanding of the variables that had missingness that needs to be addressed in our recipe(s) and model development (2285 missing for PctSomeCol18_24, 152 missing for PctEmployed16_Over, and 609 missing for PctPrivateCoverageAlone). In addition, we investigated relationships between the response and predictor variables as well as between various predictor variables. This allowed us to pin-point variables of interest that will not only help us during our imputation step but also variables that will be integral in our model to predict the reseponse variable TARGET_deathRate.